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We introduce property-independent kernels for machine learning modeling of arbitrarily many 
molecular properties. The kernels encode molecular structures for training sets of varying size, as 
well as similarity measures sufficiently diffuse in chemical space to sample over all training molecules. 
Corresponding molecular reference properties provided, they enable the instantaneous generation of 
ML models which can systematically be improved through the addition of more data. This idea is 
exemplified for single kernel based modeling of internal energy, enthalpy, free energy, heat capacity, 
polarizability, electronic spread, zero-point vibrational energy, energies of frontier orbitals, HOMO- 
LUMO gap, and the highest fundamental vibrational wavenumber. Models of these properties are 
trained and tested using 112 kilo organic molecules of similar size. Resulting models are discussed 
as well as the kernels’ use for generating and using other property models. 
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I. INTRODUCTION 

Strategies for solving computational chemistry 
problems have evolved in parallel with the capac¬ 
ity and abundance of computer hardware [T] . Access 
to ever-increasing compute power available at cen¬ 
tralized computer facilities, such as the sciCORE at 
the University of Basel, the Swiss National Super¬ 
computing Centre or the Argonne Leadership Com¬ 
puting Facility, enable “Big Data” driven computa¬ 
tional chemistry which no longer relies on experimen¬ 
tal data for training and validation, but rather on 
virtual data, obtained through predictive modeling 
and massive simulation efforts. Statistical inference 
from Big Data holds great promise in many scientific 
domains, including biology [5], climate research [3], 
high-energy physics |1] , or photonics . Due to their 
unrivaled computational efficiency (typical execution 
time is milliseconds), data driven models of molecu¬ 
lar property predictions become relevant as soon as 
they reach the accuracy of well established deductive 
quantum chemistry methods for solving approxima¬ 
tions to the electronic Schrodinger equation, such as 
Hartree-Fock, Density Functional Theory (DFT), or 
Coupled-Cluster methods. 

For the “supervised learning” task |6] of inferring 
a molecular property from structure-property dyads 
provided a priori, machine learning (ML) algorithms 
have very recently been shown to reach desirable 
quantum chemical accuracy, even when predicting 
properties for new (out-of-sample) molecules which 
had no part in training IMU. These developments 


have also triggered studies on transition state divid¬ 
ing surfaces [12], orbital-free kinetic density func¬ 
tionals m, electronic properties of crystals na, 
transmission coefficients in nano-ribbon models m, 
or densities of states in Anderson impurity mod¬ 
els m- In order to establish a consistent dataset for 
which ML models can be improved systematically 
through the addition of more data, we recently pub¬ 
lished computed DFT structures and multiple prop¬ 
erties of 134 thousand (134k) organic molecules |17| . 
Within a previous study m, discussed at the Swiss 
Chemical Society Fall Meeting 2014, we used some 
properties of this dataset to investigate the A-ML 
approach that augments less expensive deductive 
baseline theories by inductive ML models, trained 
on the baseline’s deficiencies. We demonstrated that 
chemical accuracy can be reached within this A-ML 
approach, as well as strong transferability when ap¬ 
plied to all the 134 k molecules. In this study, we 
investigate the use of property-independent kernels 
for the simultaneous modeling of multiple properties 
taken from the same data base m- To this end, we 
rely on kernel functions sufficiently diffuse to account 
for significant similarity measures among all training 
molecules. This enables us to reapply inverted kernel 
matrices to any arbitrary set of molecular properties 
and to generate the corresponding ML models on the 
same footing. We have validated our approach by si¬ 
multaneous training and prediction of 13 energetic 
and electronic molecular properties. 

This article is organized as follows: Section 2 
Methods briefly summarizes the ML notations and 
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definitions, along with a discussion of kernel function 
shape and spread. In Section 3 Computational De¬ 
tails we describe our molecular data selection strat¬ 
egy, and discuss the selection of properties in the 
dataset. We present and analyze our results for the 
performance of property-independent kernels in Sec¬ 
tion 4. In Section 5 we draw our conclusions. In 
the appendix we explain how to access and reuse the 
kernel data. 

II. METHODS 

Inarguably, one of the more appealing ML algo¬ 
rithms is kernel-ridge-regression (KRR) [T^ because 
of its numerical robustness and conceptual simplic¬ 
ity. Within KRR, the ML Ansatz for a given prop¬ 
erty p of any query molecule, q, is merely a linear 
combination of similarity measures between q and a 
finite set of N training molecules t, 

N 

Pq=^<^Kqt- ( 1 ) 

In Eq. Q Kqt is a kernel matrix element correspond¬ 
ing to molecules q and t. Here, we wish to investigate 
if Kqt can be made independent of p. We have chosen 
to use the exponentially decaying (a.k.a. Laplacian) 
kernel function of similarity measure between t and 
q, Kqt = exp {-Dtqla), where Dtq = |dt - d,] is the 
Manhattan (a.k.a. Li) norm of difference between 
two, typically non-scalar, descriptors of molecules t 
and q, respectively. The global hyperparameter a 
quantifies the kernel width m- For the Coulomb 
matrix (CM) descriptor, the combination of Lapla¬ 
cian kernel with Li norm has been shown to yield 
good ML models of atomization energies El- 

Prior to predicting molecule g’s property Pq ac¬ 
cording to Eq. (0, the vector c^, with optimal re¬ 
gression coefficient cf as element for every training 
molecule t, must be obtained through minimization 
of the penalized Lagrangian function, 

£ = (p’' - KcP)"^ (p’' - KcP) -f XcP'^KcP, (2) 

where matrices are in upper cases, vectors in lower 
case, and ()"'’ denotes a transpose, p’' represents the 
vector with reference property values for all N train¬ 
ing molecules, and K is the kernel matrix with above 
mentioned elements. The A-term imposes regulariza¬ 
tion while the first term corresponds to the conven¬ 
tional least square regression. Setting the deriva¬ 
tive of £, with respect to cP, to zero, the coefficients 
which minimize Eq. ([^ can be shown to amount to 

cP = (K + AI)-^pL (3) 


The dependence of a model’s performance on hy¬ 
perparameters, a, and A, can be understood as fol¬ 
lows. In the presence of training molecules with ex¬ 
tremely outlying properties an optimal value of A be¬ 
comes non-zero in order to quench excessively large 
elements of qP. In other words, the modeling func¬ 
tion becomes more rigid and lessens the danger of 
overfitting. The meaning of the kernel width, cr, is 
to control the cf contribution from training molecule 
t when making a new prediction, see Eq. ( 0 - Typi¬ 
cally, optimal choices of cr, and A are obtained for ev¬ 
ery molecular property and trainingset size through 
extensive use of cross-validation (CV) within train¬ 
ing molecules. CV can become a computational bot¬ 
tleneck for larger training sets. For example, for 
training sets of TV = 10 k, a 5-fold CV implies to re¬ 
peatedly invert 8 kx8 k matrices, each requiring ^0.5 
CPU hours on modern compute hardware. 

In this study, we have explored the possibility to 
always keep all training molecules fixed, to estimate 
the hyperparameter cr beforehand, and to set A to 
zero. This allows us to obviate all CVs, and to use 
a single identical kernel matrix K for any proper¬ 
ties. More specifically, once K~^ has been computed 
and stored, regression coefficients for any number of 
properties, pi,p 2 , ■■■,Pn, can be computed simultane¬ 
ously, 

^ C = K-1P^ 

( 4 ) 

where C and P'' are the N x n regression coefficient 
and property matrices, respectively. Consequently, 
instead of n CVs with computationally demanding 
multiple kernel matrix inversions, each scaling as 
0{N^), the computational cost is now being dom¬ 
inated by one kernel inversion plus n matrix-vector 
multiplications, each scaling as 0{N'^). 

III. COMPUTATIONAL DETAILS 

For typical molecular datasets, with N > 5 k, we 
find that optimal a converge towards a large value 
which ensures non-vanishing contributions from all 
training compounds, while A converges towards zero. 

The choice of setting A to zero is easily justified: 

As also seen below, property values in the molecu¬ 
lar dataset show distributions centered at an aver¬ 
age value with relatively few outliers, if any. Over¬ 
fitting, leading to large coefficients for these outliers, 
is therefore unlikely to influence the performance 
of ML models based on thousands of well-behaved 
training molecules. For the Laplacian kernel used 
here, extreme (too large/small) values of a lead to 
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Property [max. abs. value] 


Figure 1. Density distributions of thirteen properties for the set of 112 k organic molecules made up of CHONF. The 
abscissa corresponds to values of the properties relative to corresponding maximal absolute values in the dataset: 
|Lio|max = 2608.5 kcal/mol, |t/|max = 2626.4 kcal/mol, |Ff|inax = 2643.0 kcal/mol, IGlmax = 2417.1 kcal/mol, 
ia|max = 196.6 flo, |G„|max =47.0 Cal/mol/K, |/r|niax = 30.0 D, |{i7^)|max = 3375 Oo, |ZPVE|niax = 171.9 kcal/mol, 
IehomoI max 10.78 eV, |£lumo| max 4.68 eV, lAel 

max 12.57 eV, and |a;i| 

max =3876.7 cm 


loss of information regarding training descriptor dis¬ 
tances, Dij. For CT « 0, off-diagonal elements of 
the kernel matrix, Kij = exp {—Dij/a) vanish, re¬ 
sulting in a unit-kernel-matrix, K = I. On the other 
hand, for cr >> 1, a kernel-matrix of ones is obtained 
which would be singular (and hence non-invertible) 
for TV > 1, and anyways does not resolve Dij. Here, 
we define the optimal a value to be defined such 
that all kernel elements are in between 0.5 and 1. 
This can be accomplished through the constraint for 
the smallest kernel matrix element, i.e. the kernel 
element which corresponds to the two most distant 
training molecules, 

exp (-£>““/cropt) = 1/2, (5) 

resulting in Copt = log(2). As such, through 

use of CTopt and A = 0, a property-independent global 
kernel matrix is obtained which only needs to be 
inverted once before it is used to generate regres¬ 
sion coefficients for all molecular properties of in¬ 
terest. For randomly sampled Ik molecules, from 
the dataset considered in this study {vide infra), 
Dfj^^ = 677 a.u., suggesting tJopt ~ 977 a.u. This 
number is consistent with a numerical grid search for 
iJopt which identifies cr = 1000 a.u. to be optimal for 
a Ik ML model of atomization enthalpies. In the 


remainder of this study we discuss the performance 
of ML models based on inverted global kernels for up 
to 13 molecular properties, always with (Topt = 1000 
a.u. and A = 0, irrespective of the training set size 
N. 

We have also investigated if our observations de¬ 
pend on the choice of molecular descriptor, d. To 
this end, we have considered two different descrip¬ 
tors, namely the Coulomb-matrix (CM) with rows 
and columns uniquely permuted j9], as well as the 
bag-of-bonds (BOB) descriptor, amounting to 
an ordered set of weighted interatomic distances. 

We have used the published m quantum chem¬ 
istry results for the smallest 133885 (134 k) organic 
molecules subset of the GDB-17 published by Rey- 
mond et al. m, which contains over 166giga 
molecules. This 134 k dataset contains relaxed ge¬ 
ometries and chemical properties computed using the 
DFT (B3LYP with basis set 6-31G(2df,p)). Here, we 
have pruned this dataset by eliminating all molecules 
with up to 8 “heavy” atoms (not counting hydro¬ 
gens) which, trivially, would be outliers since the 
dataset is dominated by molecules with 9 heavy 
atoms. For the resulting 111594 (112 k) molecules, 
we have considered the following 13 computed prop¬ 
erties: Zero-Kelvin internal energy Uq, thermochem- 
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istry energetics at 298.15 K (internal energy U, en¬ 
thalpy H, free energy G, all three properties for 
the process of atomization, and heat capacity C^); 
isotropic molecular polarizability, a; electronic ra¬ 
dial expectation value, harmonic zero-point vi¬ 

brational energy, ZPVE; energy of highest occupied 
molecular orbital (HOMO), enOMo; energy of low¬ 
est unoccupied molecular orbital (LUMO), Elumo; 
HOMO-LUMO gap, Ae; and the highest fundamen¬ 
tal vibrational wavenumber, wi, in the 3000-3900 
cm-i range. 

Fig. features the density distributions of the rel¬ 
ative values of these properties, scaled with respect 
to their maximal values in the dataset. We note 
the properties with large property density, polariz¬ 
ability, dipole moment, energy of HOMO, and radial 
expectation value. Compared to these properties, 
all thermochemical properties are less densely dis¬ 
tributed. Properties even more sparsely distributed 
include LUMO energy, gap and ZVE; interestingly 
their densities also exhibit multimodal distributions 
(see Fig.[^, possibly arising from characteristic func¬ 
tional group moieties present in the dataset. The 
distribution of wi shows three narrow peaks, which 
can readily be interpreted as arising from C-H, N- 
H, and 0-H (symmetric and asymmetric) stretching 
modes. Corresponding values from literature |22| for 
similar wavenumbers at the same level of theory read 
for CH 4 (Ai, 3038, and T 2 , 3152 cm"!), NH 3 (Ai, 
3459, and E, 3576 cm“^), and H 2 O (Ai, 3802, and 
B 2 , 3906 cm“^). Further details regarding the gene¬ 
sis of the dataset can be found in Ref. [m 


IV. RESULTS AND DISCUSSION 

Using single kernels of varying size, the systematic 
decay of ML prediction errors is summarized in Fig. 
for all the aforementioned properties. Note that all 
reported error measures refer to out-of-sample pre¬ 
dictions, i.e. for a given training set of size N, errors 
are presented as measured on the remaining 112 k - 
N molecules. To compare the error across differ¬ 
ent properties, irrespective of units and scale, we 
have used the mean absolute error (MAE) relative 
(i.e., RMAE) to desired quantum chemistry accuracy 
norms as a suitable error measure. The target accu¬ 
racy for the thermochemical quantities, and orbital 
energies is the highly coveted “chemical accuracy” 
for energetics, i.e., 1 kcal/mol. For uJi, and ZPVE, 
both within the harmonic approximation, we have 
selected a target accuracy of 10 cm“^. This value is 
slightly larger than the average accuracy of coupled 
cluster method, CCSD(T) + with converged basis 


sets |23| . for predicting harmonic wavenumbers of 
small molecules, as measured by comparison to their 
experimentally determined counterparts. For dipole 
moment and isotropic polarizability, the target accu¬ 
racies employed are 0.1 D, and 0.1 aj), respectively. 
These thresholds are within the uncertainty of pre¬ 
dicted values of these properties at the CCSD level 
of theory [53] . 



N 


Figure 2. Relative mean absolute errors (RMAE) in 
all ML predicted properties of out-of-sample molecules, 
in the 112 k set using the same kernel function of size 
up to V = 40 k. See text for the definition of RMAE 
for respective properties. TOP: Thermochemistry and 
vibrational properties. BOTTOM: Electronic properties. 

The most compelling feature in Fig. is the sys¬ 
tematic decay in RMAEs for all molecular proper¬ 
ties. This amounts to numerical evidence that pre¬ 
dictive ML-models for multiple properties can be 
built using a single kernel matrix with no property- 
specific parametrization. Furthermore, the accuracy 
of these single kernel models can systematically be 
improved through the addition of more training data. 
Among all properties, we note maximal learning 
rates for (i?^)—a measure of diffuseness of the elec¬ 
tron density, and quite possibly more directly linked 
to molecular geometry and composition than the 
other observables. ZPVE exhibits the second best 
learning rate, implying a potential scope to invest 
on building larger ML models for the data-driven, 
rapid and accurate estimations of ZPVE corrections 
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for a multitude of energetics such as barrier heights, 
reaction/dissociation energies, and thermochemistry. 
At the limit of 40 k training molecules, none of the 
13 properties were predicted within an RMAE of 
1 , indicating the need to employ larger training set 
sizes, fj, and uji are the more difficult properties to 
learn as evinced in Fig. We note that despite 
their strongly differing distribution (shown in Fig. 
HOMO and LUMO values have the same learning 
rate and only differ in their off-set. For all prop¬ 
erties, the near-linear learning curves suggest that 
target values of RMAF=1 could be reached if only 
sufficiently large kernels were constructed. Due to 
the logarithmic scaling, however, the necessary com¬ 
putational investment for training set generation will 
grow increasingly prohibitive. We have also found 
such behavior to be independent of descriptor choice: 
When repeating the ML calculations using the new 
descriptor BOB m, instead of the Coulomb ma¬ 
trix, we noted the same trends in error decays with 
slightly better overall performance. 



pairwise relationships between properties H vs. a, 
and H vs. as well as between the correspond¬ 

ing regression coefficients, cj. [Eq. 01- On the one 
hand, a exhibits the familiar lower linear bound in 
iJ, as also mentioned in Ref. |8j and reminiscent of 
the minimal polarizability principle |25| . or the re¬ 
lated maximum hardness principle |26| . By contrast, 
the vs. c“ scatterplot exhibits complete absence 
of correlation merely implying a disk shaped bivari¬ 
ate normal distribution which sets an upper bound 
of (cf)^ + (c“)^, for any training molecule, t. On 
the other hand, and in contrast to a, the electron 
spread, correlates rather poorly with H (see 

bottom panel in Fig. |^. The scatterplot for the 
corresponding ct, however, now shows a character¬ 
istic cross shape, suggesting a mutually complemen¬ 
tary mode of action among training molecules. More 
specifically, training molecules very relevant for mod¬ 
eling one property (i.e., large Ct) are insignificant for 
modeling the other property, and vice versa. Such a 
pattern could possibly arise from a bivariate normal 
distribution bound of and ^ with Lp-norm 
for 0 < P < 1. 

Overall, however, we note that the distribution 
of is governed by the limits imposed through 
CTopt- For the above mentioned trivial case of ultra- 
tight kernels, i.e. in the limit that cr —> 0, we have 
K = = I, and hence = p'’. This implies that 

for the diffuse kernel functions used in this study 
through choice of dopt = log(2), one should 

not expect c to reflect the same trends as proper¬ 
ties. Finally we note the coefficients of all properties 
to show regularized distributions that are peaked at 
zero (i.e. no over-fitting of outliers), giving further 
justification to our choice of A = 0. 


Figure 3. Scatter plots for properties and corresponding 
coefficients (using the 40k kernel). LEFT: Atomization 
enthalpy, H, versus isotropic polarizability, a (TOP), 
and versus radial expectation value, {R^) (BOTTOM), 
respectively. The gray line (y — —0.51 x — 0.02) indi¬ 
cates the linear fit to the lower bound a for any given H. 
RIGHT: Scatterplots of corresponding regression coeffi¬ 
cients, cf vs. Ct (TOP), and cf vs. (BOTTOM). 

All values are in units of maximal absolute property as 
defined in Fig. 

In previous studies we have already analyzed the 
effect on regression coefficients of a specific property 
due to tuning an external parameter m, here we 
can exploit the single kernel Ansatz to directly com¬ 
pare regression coefficients of different properties for 
the same training molecules. To demonstrate this 
point we have considered the 40 k coefficients used 
to predict H, a, and (i?^). Fig. [^illustrates the 


V. CONCLUSION 

We have validated and exploited the fact that ML 
models based on property-invariant kernels in chemi¬ 
cal space provide a consistent framework to learn any 
arbitrary set of global molecular properties on the 
exact same footing. Using quantum chemistry data 
for over 100 k organic molecules, we have presented 
evidence how this facilitates the simultaneous mod¬ 
eling of several properties. Numerical results have 
been discussed for the single kernel based ML model 
of a wide variety of electronic and energetic molec¬ 
ular properties, including vibrational wavenumbers. 
Overall, we have generated 182 kernels of varying 
sizes for this study. To enable the reuse of the ker¬ 
nels for new properties by the community, we have 
made them publicly accessible (see Appendix). Due 
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to the fact that the computationally most demanding 
step in the development of ML models is matrix in¬ 
version, requiring up to 2 CPU days when using 40 k 
training molecules, we include the inverse of the ker¬ 
nel matrices along with the data. This should enable 
the accelerated development of models for new prop¬ 
erties that can be applied to any molecules which fall 
within the of the employed kernel. 



Figure 4. Schematic description of the database access 
and usage. TOP: For various training set sizes between 
Ik, and 40k, inverse of kernel matrices, descriptors, an 
example program, and corresponding indices of the train¬ 
ing molecules in the 134 k dataset [El are archived (see 
Appendix for details). MIDDLE: With the input argu¬ 
ment 1, and input property values, the program calcu¬ 
lates the corresponding c vector. BOTTOM: With the 
input argument 2, c vector computed a priori, and ge¬ 
ometries of query molecules, the program estimates the 
corresponding properties. 
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VII. APPENDIX: USAGE OF KERNELS 

We provide fnll access to all data generated in this 
stndy m- Fig .[^illustrates the organization and ex¬ 
pected nsage of the data. Using inverse kernel matri¬ 
ces provided in the dataset, c vectors (see Eq. 0) of 
new properties can be compnted through a matrix- 
vector operation (see Eq. @)- For this purpose, one 
can use properties reported in Ref. 1171 or compute 
them fresh. It is possible to train a model for a prop¬ 
erty computed at geometries from a theory slightly 
different than the one employed here. In such cases, 
c will account for both changes in theory, and in 
geometries (see discnssions in Ref. 1151) . This c vec¬ 
tor can then be nsed for the estimation of properties 
of qnery molecnles with nine of the C, N, O, and 
F atoms. We cantion the nser that one shonld not 
expect predictive power for molecnles that differ snb- 
stantially from trainingset molecnles in composition 
or geometry. 
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